This blog post is a three-part series. See part 1 for retrieving the dataset and part 2 for the calculation of similarity between test cases.
In the previous blog post, we've seen how we can calculate the structural (dis-)similarity between test cases based on the invoked production methods. We manually spotted some test cases that were very similar by searching through the whole dataset. This could get very tedious. So in this blog post, I show how we can partly automate the process of identifying groups of similar test cases as well as how we can visualize those groups. The aim is to find test cases that test the same production code but shouldn't do it.
Note: Albeit we use a kind of artificial dataset based on pure unit tests (for simplicity reasons), this data analysis is a very powerful way for spotting test duplications of long-running end-to-end-tests that were written e.g. with the Selenium browser automation framework.
Let's first read in the data from the previous (dis-)similarity calculation with Pandas and have a look at it. Because we have a dataset with multi-level indexes and columns, we have to specify this accordingly with the index_col
and header
parameters.
In [1]:
import pandas as pd
distance_df = pd.read_excel(
"datasets/test_distance_matrix.xlsx",
index_col=[0,1],
header=[0,1])
# show only subset of data
distance_df.iloc[:5,:2]
Out[1]:
Our dataset shows the cosine distances between all unit test cases (test_method
) respectively to the called production methods. That means if two test cases invoke absolutely the same production methods, the distance is 0. If there are only a few calls to the same production method, the distance is something between 0 and 1 depending on the similar number of those calls. If the test cases call completely different production methods, the distance is 1.
We've spotted some interesting test cases manually and discussed these in detail in the previous blog post. Now we want to do this a little bit more automated. Let's visualize the data first to see what we've achieved so far.
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=[10,8])
sns.heatmap(
distance_df,
xticklabels=False,
yticklabels=False)
Out[2]:
This heat map enables us to have a quick look at patterns of the whole data graphically. Black spots show groups of test methods that call the same production methods while light pink colored areas signal disjunct calls to the production code. The ordering of the entries of the heat map is alphabetically by the class names of the test methods and those class names tend to begin with the same prefix (like Comment
or Todo
).
Let's have a closer look at the upper left corner of the heat map. It shows the ten first entries of the matrix with the test classes AddCommentTest
and AddSchedulingDateTest
.
In [3]:
sns.heatmap(distance_df.iloc[:10,:10])
Out[3]:
Discussion
The color for the distance of the test methods of the AddCommentTest
class is deeply dark. That means that the tests in the test class are absolutely similar regarding their structure. If our goal would be the reduction of test code, we could think about merging some test cases together or using some kind of parametrized test execution to avoid duplications.
In contrast, the AddSchedulingDateTest
shows a more diverse coloring: The test methods addDateToScheduling
and addTwoDatesToScheduling
are structurally almost identical (given that one test just adds another date, this makes perfect sense). The more orange colored test failsIfSchedlindIdIsNotExisting
could be a reason (beneath the typo) for further investigations because it differs almost too much from the other test cases. Maybe this test case can be moved to a more dedicated test class (for checking the correct generation of ids in our case).
Unfortunately, the 422x422 big distance matrix distance_df
isn't a good way to spot similarities very efficiently. There are areas of test similarities that don't occur along the diagonal. Fortunately, there are many ways to improve this situation.
In this blog post, we want to break down the multidimensional result into a two dimensional representation using multidimensional scaling (MDS). MDS tries to find a representation of our 422-dimensional data set into a two-dimensional space while retaining the distance information between all data points (= test methods). We can use the machine learning library scikit-learn that provides an implementation for multidimensional scaling out of the box.
Pandas' DataFrame
just integrates very nicely with the MDS
module of scikit-learn, too. So we just have to say that we want to use our precomputed dissimilarity matrix distance_df
as measures for the distance information. We then can let MDS
figure out a suitable two-dimensional representation of our dataset as well as a suitable transformation by using the fit_transform
method.
In [4]:
from sklearn.manifold import MDS
# uses a fixed seed for random_state for reproducibility
model = MDS(dissimilarity='precomputed', random_state=10)
# this could take some seconds
distance_df_2d = model.fit_transform(distance_df)
distance_df_2d[:5]
Out[4]:
Next, we plot the now two-dimensional matrix with matplotlib
. We colorize all data points according to the name of the test classes (= the first level of distance_df
's index). We can achieve this by assigning each type to a number within 0 and 1 (relative_index
) and draw a color from a predefined color spectrum (cm.hsv
in this case) for each type. With this, each test class gets its own color. This enables us to quickly reason about test classes that belong together structurally.
In [5]:
%matplotlib inline
from matplotlib import cm
import matplotlib.pyplot as plt
# brew some colors
relative_index = distance_df.index.labels[0].values() / distance_df.index.labels[0].max()
colors = [x for x in cm.hsv(relative_index)]
# plot the 2D matrix with colors
plt.figure(figsize=(8,8))
x = distance_df_2d[:,0]
y = distance_df_2d[:,1]
plt.scatter(x, y, c=colors)
Out[5]:
Discussion
We now have the visual information about test methods that call similar production code in 2D.
With this representation, we can have a look at the various groups to check if the groupings are OK or if we have to restructure some test cases because of too much similarity or confusion of responsibilities.
Let's quickly find both types of groupings programmatically by using another machine learning technique: density-based clustering! With this technique, we can let find data points that are very close together automatically. Again, we can use scikit-learn with its DBSCAN
implementation to identify data points that are close together. We plot this information into the plot above to visualize dense groups of data.
For the parameters eps
(~ maximal distance between the data points so that they are seen as a group) and min_samples
(~ minimal distance between the data points to be considered as groups), we choose the right values in an iterative manner until we've got the groupings that we would otherwise have identified visually.
In [6]:
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.08, min_samples=10)
clustering_results = dbscan.fit(distance_df_2d)
clustering_results
Out[6]:
We plot all data points of our distance matrix together with the found members of clusters (components_
) in one scatter plot.
In [7]:
plt.figure(figsize=(8,8))
cluster_members = clustering_results.components_
# plot all data points
plt.scatter(x, y, c='k', alpha=0.2)
# plot cluster members
plt.scatter(
cluster_members[:,0],
cluster_members[:,1],
c='r', s=100, alpha=0.1)
Out[7]:
The scatter plot confirms what we've seen as humans with our eyes: There are some groupings that belong together by forming a dense cluster. We cann access these data points e. g. by their cluster labels labels_
and throw away all non-cluster data points which label's value is -1
:
In [8]:
clustered_tests = pd.DataFrame(index=distance_df.index)
clustered_tests['cluster'] = clustering_results.labels_
cohesive_tests = clustered_tests[clustered_tests.cluster != -1]
cohesive_tests.head()
Out[8]:
We now can take a look at various metrics like the number of classes that declare those methods (nunique
) and the number of cluster members aka test methods (count
).
In [9]:
test_methods_and_classes_per_cluster = \
cohesive_tests.reset_index() \
.groupby("cluster").test_type \
.agg({"nunique", "count"})
test_methods_and_classes_per_cluster.head()
Out[9]:
We can also see which test classes belong to a cluster.
In [10]:
test_classes = cohesive_tests.reset_index().groupby("cluster").test_type.apply(set)
test_classes
Out[10]:
If we join both DataFrames, we get a nice summary of clusters with test classes we should have a deeper look into.
In [11]:
test_analysis_result = test_methods_and_classes_per_cluster.join(test_classes)
test_analysis_result
Out[11]:
For a more actionable representation of our findings, let's print the results in a good old, console-like way.
In [12]:
def print_results(series):
print(
"Cluster {} contains {} test methods in {} test classes."\
.format(series.name, series['count'], series['nunique']))
print(" The test classes are:")
for test_class in series['test_type']:
print(" -{}".format(test_class))
print("-"*60)
test_analysis_result.apply(print_results, axis=1);
Discussion
With this list, we see e.g. that there is some mixing in cluster 0. Based on the pure test classes' names, we can see that the test methods test the same production code, but are clearly not alike regarding their functionality. This is a cluster that needs some refactoring. We can also have a deeper look into clusters 4 and 6 because the contained test classes test potentially the same production code which shows that
What a trip! We've started from a data set that showed us the invocations of production methods by test methods (using jQAssistant) in the first part. We also went our way deep through the three mathematical / machine learning techniques cosine_distances
(in the second part) as well as MDS
and DBSCAN
in this post. Finally, we've found out which different test class classes test the same production code. The result is a helpful starting point for reorganizing test cases.
In general, we saw how we can transform software specific problems to questions that can be answered by using standard Data Science tooling. Knowing that this is possible opens up the way for us for a more data-centric thinking to provide real, actionable insights instead of pure guessing.